Search CORE

10 research outputs found

Recommended from our members

MapReduce based RDF assisted distributed SVM for high throughput spam filtering

Author: Caruana Godwin
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and was awarded by Brunel UniversityElectronic mail has become cast and embedded in our everyday lives. Billions of legitimate emails are sent on a daily basis. The widely established underlying infrastructure, its widespread availability as well as its ease of use have all acted as catalysts to such pervasive proliferation. Unfortunately, the same can be alleged about unsolicited bulk email, or rather spam. Various methods, as well as enabling architectures are available to try to mitigate spam permeation. In this respect, this dissertation compliments existing survey work in this area by contributing an extensive literature review of traditional and emerging spam filtering approaches. Techniques, approaches and architectures employed for spam filtering are appraised, critically assessing respective strengths and weaknesses. Velocity, volume and variety are key characteristics of the spam challenge. MapReduce (M/R) has become increasingly popular as an Internet scale, data intensive processing platform. In the context of machine learning based spam filter training, support vector machine (SVM) based techniques have been proven effective. SVM training is however a computationally intensive process. In this dissertation, a M/R based distributed SVM algorithm for scalable spam filter training, designated MRSMO, is presented. By distributing and processing subsets of the training data across multiple participating computing nodes, the distributed SVM reduces spam filter training time significantly. To mitigate the accuracy degradation introduced by the adopted approach, a Resource Description Framework (RDF) based feedback loop is evaluated. Experimental results demonstrate that this improves the accuracy levels of the distributed SVM beyond the original sequential counterpart. Effectively exploiting large scale, ‘Cloud’ based, heterogeneous processing capabilities for M/R in what can be considered a non-deterministic environment requires the consideration of a number of perspectives. In this work, gSched, a Hadoop M/R based, heterogeneous aware task to node matching and allocation scheme is designed. Using MRSMO as a baseline, experimental evaluation indicates that gSched improves on the performance of the out-of-the box Hadoop counterpart in a typical Cloud based infrastructure. The focal contribution to knowledge is a scalable, heterogeneous infrastructure and machine learning based spam filtering scheme, able to capitalize on collaborative accuracy improvements through RDF based, end user feedback. MapReduce based RDF Assisted Distributed SVM for High Throughput Spam Filterin

Brunel University Research Archive

An ontology enhanced parallel SVM for scalable spam filter training

Author: Bauer
Blanco
Blanzieri
Blei
Breiman
Cao
Caruana
Chawla
Colas
Cristianini
Dean
Do
Gansterer
Godwin Caruana
Graf
Hall
Huang
Kearns
Kim
Maozhen Li
Mei
Platt
Suykens
Taura
Vapnik
Wang
Woodsend
Yang Liu
Zanghirati
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart

Crossref

Brunel University Research Archive

EUROMOD update : feasibility study : Malta (Tax-Benefit Systems 2007-2010)

Author: Caruana Etienne
Gravino Daniel
Mercieca Pauline
Micallef Frank
Mifsud Godwin
Miljanic Brinkworth Maja
Vella Kevin J.
Publication venue: European Centre for Social Welfare Policy and Research
Publication date: 01/03/2011
Field of study

The purpose of this study is to examine the technical feasibility of micro-simulation model application for the analysis of impact of policy on social integration from the national as well as from the EU perspective. This is the first time that Malta’s tax-benefit system has been analysed from the angle of the main elements of this system implying the policy rules that are underlying the entitlement criteria defining them. This was an opportunity for the main players in this field to work in synergy on this vital issue: the Ministry for the Family and Social Solidarity, in charge of social benefits, Ministry of Finance responsible for the fiscal policy and income tax system in particular, and the National Statistics Office tasked with income data collection based on the EU-SILC methodology. This Feasibility Study describes the situation as it was in year 2007 and the major changes that have taken place in 2008 and 2009 and 2010. Firstly, the study describes the main elements of the tax-benefit system namely: income, income tax brackets, capital resources and Social Security contributions. The second section of the study illustrates the main sources of data to be used for modelling purposes and also shows the examples of the calculation of income tax and social benefits. It has been agreed that the EU SILC 2008 data would be used, for income element since Malta has joined this system of data collection way back in 2005. The third section of the study firstly outlines the qualities and limitations of the input data set. This section also focuses on specificities of Malta’s data collection and possible difficulties regarding model application. The study points at the possible combinations of sample and population databases. Also, simulation possibilities have been specified for both systems separately. Finally, the non-take up of benefit and the issue of tax and benefit fraud illustrate the situation and the possible unknown element on both sides.peer-reviewe

OAR@UM

A Distributed Anti-Spam Architecture using Plug-ins

Author: Caruana Godwin
Darbyshire Paul
Publication venue: Information Institute
Publication date
Field of study

Exploration of the factors that influence new Australian dental graduates to work rurally and their perspectives of rural versus metropolitan employment

Author: Australian Health Practitioner Regulation Agency
Australian Institute of Health and Welfare
Bazen JJ
Campbell N
Caruana EJ
Gamm L
Godwin D
Grobler L
Hall D
Jamar E
Jamieson JL
Johnson G
Johnson GE
Kruger E
Lalloo R
McFarland KK
Renner DM
Richards HM
Schoo AM
Schoo AM
Sengupta TK
Srivastava A
Wilson M
Wilson NW
Publication venue: 'Wiley'
Publication date
Field of study

Crossref